{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2020年新世相公众号推文标题分类\n",
    "- [仓库链接](https://gitee.com/Xhewen/web_data_mining/tree/master/%E6%96%B0%E4%B8%96%E7%9B%B8_%E9%87%87%E9%9B%86%E5%85%AC%E4%BC%97%E5%8F%B7_Selenium)\n",
    "\n",
    "### 数据加值宣言：\n",
    "- 本项目以**新世相**微信公众号为主。\n",
    "1. 首先先搜索了跟“新世相”三个字相关的公众号有5个，并且得到了这些微信公众号的基本信息。\n",
    "2. 最后选取了**新世相**公众号，抓取了该公众号2020年发布的所有推文，共抓到了185篇推文，并用关键词来对推文进行分类，发现“照片”、“爱情”、“我们”、“交换”、“女孩”这五个关键词排名前五，通过对2020年推文标题的分类总结得出该公众号以当代年轻人的痛点和情感方面为主，开展过晚安交换等活动。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "公众号 = \"新世相\"\n",
    "fn = { \"output\" : { \"公众号_htm_snippets\": \"data_raw_src/公众号_htm_snippets_{公众号}.tsv\",\n",
    "                    \"公众号_df\": \"data_raw_src/公众号_df_{公众号}.tsv\",\n",
    "                    \"公众号_xlsx\": \"data_sets/公众号_url_{公众号}.xlsx\" } \\\n",
    "      }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 采集公众号（selenium）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from lxml.html import fromstring\n",
    "import time\n",
    "from random import random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Karoja\\AppData\\Roaming\\Python\\Python37\\site-packages\\ipykernel_launcher.py:18: DeprecationWarning: use options instead of chrome_options\n"
     ]
    }
   ],
   "source": [
    "from selenium import webdriver\n",
    "from selenium.webdriver.common.desired_capabilities import DesiredCapabilities\n",
    "\n",
    "#caps=dict()\n",
    "#caps[\"pageLoadStrategy\"] = \"none\"   # Do not wait for full page load\n",
    "\n",
    "opts = webdriver.ChromeOptions()\n",
    "opts.add_argument('--no-sandbox')#解决DevToolsActivePort文件不存在的报错\n",
    "opts.add_argument('window-size=1920x3000') #指定浏览器分辨率\n",
    "opts.add_argument('--disable-gpu') #谷歌文档提到需要加上一这个属性来规避bug\n",
    "opts.add_argument('--hide-scrollbars') #隐藏滚动条, 应对些特殊页面\n",
    "#opts.add_argument('blink-settings=imagesEnabled=false') #不加载图片, 提升速度\n",
    "#opts.add_argument('--headless') #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败\n",
    "\n",
    "opts.binary_location = r\"C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe\" #\"H:\\_coding_\\Gitee\\InternetNewMedia\\CapstonePrj2016\\chromedriver.exe\"  \n",
    "\n",
    "# \"H:\\_coding_\\Gitee\\InternetNewMedia\\CapstonePrj2016\\chromedriver.exe\"  \n",
    "driver = webdriver.Chrome( chrome_options = opts) #desired_capabilities=caps,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.get(\"https://mp.weixin.qq.com\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 登录"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "payload =  {\"account\": \"863724469@qq.com\", \"password\": \"wanangn4.12\"}\n",
    "driver.find_element_by_xpath('//div[@class=\"login__type__container login__type__container__scan\"]/a').click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"account\"]').clear()\n",
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"account\"]').send_keys(payload['account'])\n",
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"password\"]').clear()\n",
    "driver.find_element_by_xpath('//form[@class=\"login_form\"]//input[@name=\"password\"]').send_keys(payload['password'])\n",
    "driver.find_element_by_xpath('//div[@class=\"login_btn_panel\"]/a').click()\n",
    "\n",
    "driver.find_element_by_xpath('//div[@class=\"login_btn_panel\"]/a').click()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 找到“素材管理”"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'展开'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//a[@id=\"m_open\"]')\n",
    "element.click()\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "main_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://mp.weixin.qq.com/cgi-bin/appmsg?begin=0&count=10&t=media/appmsg_list&type=10&action=list&token=1005107966&lang=zh_CN'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "driver.execute_script(\"window.scrollTo(0,document.body.scrollHeight)\")\n",
    "element = driver.find_element_by_xpath('//li[@title[contains(.,\"素材管理\")]]/a') \n",
    "# main_content = element.get_attribute('innerHTML')\n",
    "# main_content\n",
    "url_素材管理= element.get_attribute(\"href\")\n",
    "url_素材管理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.get(url_素材管理)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 新建图文消息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "element = driver.find_element_by_xpath('//*[text()[contains(.,\"新建图文消息\")]]') \n",
    "main_content = element.get_attribute('innerHTML')\n",
    "main_content\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['CDwindow-59D42DA6B1979B79A069E1CF8177347C', 'CDwindow-62469293160A5288758D21EAE1CEA6E1']\n"
     ]
    }
   ],
   "source": [
    "# 查看已打开窗口\n",
    "print (driver.window_handles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.switch_to.window(driver.window_handles[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 选择公众号"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                超链接              \n"
     ]
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//*[text()[contains(.,\"超链接\")]]') \n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "选择其他公众号\n"
     ]
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//*[text()[contains(.,\"选择其他公众号\")]]') \n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "driver.find_element_by_xpath('//form//div[@class=\"inner_link_account_area\"]//input[@class=\"weui-desktop-form__input\"]').clear()\n",
    "driver.find_element_by_xpath('//form//div[@class=\"inner_link_account_area\"]//input[@class=\"weui-desktop-form__input\"]').send_keys(公众号)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div class=\"weui-desktop-icon weui-desktop-icon__inputSearch weui-desktop-icon__small\"><!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <!----> <svg width=\"16\" height=\"16\" viewBox=\"0 0 16 16\" xmlns=\"http://www.w3.org/2000/svg\"><path d=\"M11.33 10.007l4.273 4.273a.502.502 0 0 1 .005.709l-.585.584a.499.499 0 0 1-.709-.004L10.046 11.3a6.278 6.278 0 1 1 1.284-1.294zm.012-3.729a5.063 5.063 0 1 0-10.127 0 5.063 5.063 0 0 0 10.127 0z\"></path></svg> <!----> <!----> <!----> <!----></div>\n"
     ]
    }
   ],
   "source": [
    "# 点放大镜搜\n",
    "element = driver.find_element_by_xpath('//button[@class=\"weui-desktop-icon-btn weui-desktop-search__btn\"]')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/5ROs96OaibImzsCJhd0eXIzLfpoicm0RZqPBcaECvKx9e4o6VrVDZH1AqerI5ofrsROdTHUcmcN1IXyiaMtiaqH8Lw/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">新世相</strong> <i class=\"inner_link_account_wechat\">微信号：thefair2</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/ufqQw7lroLc9W9YQBibev8xJunfHPvcHdHh0H0cqwiaVTQb2zMibreM3NNFMwZVpHb0Kajbl3dSmyutfO8hHWiadkQ/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">新世相X研究所</strong> <i class=\"inner_link_account_wechat\">微信号：thefairlab</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/ib0l8DHhOSLsbd8RDcuVSShNmO4wMzoJpxPYicrxz54hP5ib1Vuwd1uxXl11NHVmD3tC2flicHuZtY72lkJrMQxOhQ/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">新世相读书会</strong> <i class=\"inner_link_account_wechat\">微信号：school-of-life-</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">服务号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/GAMZCakEIhd9d0TOBTiaqp83Prd2ibrMjOUcLe5DT1IxpibNsibQMcLywX1rHjFc4NQNjKePGlDYqhl2EI6IXibicLVg/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">唱唱反调新世相</strong> <i class=\"inner_link_account_wechat\">微信号：cyshuimo</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li><li class=\"inner_link_account_item\"><div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/ysZh1iaEQ9KwThUZjTOppgvOSK7fNkBH8OeItlQmRPk1IUKic35rGRn9PibEg8MLYiaicuhkic8SGuTOBmv0Aias35Blg/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">日本新世相</strong> <i class=\"inner_link_account_wechat\">微信号：ribenxinshixiang</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div></li>\n"
     ]
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//ul[@class=\"inner_link_account_list\"]')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "公众号SERP = main_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 解析\n",
    "root = fromstring(公众号SERP) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "主 = root.xpath('//li[@class=\"inner_link_account_item\"]')\n",
    "\n",
    "account_list = []\n",
    "\n",
    "for e in 主:\n",
    "    account_nickname = e.xpath('./div/strong[@class=\"inner_link_account_nickname\"]')[0].text\n",
    "    account_wechat = e.xpath('./div/i[@class=\"inner_link_account_wechat\"]')[0].text\n",
    "    account_img = e.xpath('./div/img/@src')[0]\n",
    "    account = {\"nickname\": account_nickname, \"wechat\": account_wechat, \"img\": account_img,}\n",
    "    account_list.append(account)\n",
    "\n",
    "df_account = pd.DataFrame(account_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>nickname</th>\n",
       "      <th>wechat</th>\n",
       "      <th>img</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>新世相</td>\n",
       "      <td>微信号：thefair2</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/5ROs96OaibImzsC...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>新世相X研究所</td>\n",
       "      <td>微信号：thefairlab</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/ufqQw7lroLc9W9Y...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>新世相读书会</td>\n",
       "      <td>微信号：school-of-life-</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/ib0l8DHhOSLsbd8...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>唱唱反调新世相</td>\n",
       "      <td>微信号：cyshuimo</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/GAMZCakEIhd9d0T...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>日本新世相</td>\n",
       "      <td>微信号：ribenxinshixiang</td>\n",
       "      <td>http://mmbiz.qpic.cn/mmbiz_png/ysZh1iaEQ9KwThU...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  nickname                wechat  \\\n",
       "0      新世相          微信号：thefair2   \n",
       "1  新世相X研究所        微信号：thefairlab   \n",
       "2   新世相读书会   微信号：school-of-life-   \n",
       "3  唱唱反调新世相          微信号：cyshuimo   \n",
       "4    日本新世相  微信号：ribenxinshixiang   \n",
       "\n",
       "                                                 img  \n",
       "0  http://mmbiz.qpic.cn/mmbiz_png/5ROs96OaibImzsC...  \n",
       "1  http://mmbiz.qpic.cn/mmbiz_png/ufqQw7lroLc9W9Y...  \n",
       "2  http://mmbiz.qpic.cn/mmbiz_png/ib0l8DHhOSLsbd8...  \n",
       "3  http://mmbiz.qpic.cn/mmbiz_png/GAMZCakEIhd9d0T...  \n",
       "4  http://mmbiz.qpic.cn/mmbiz_png/ysZh1iaEQ9KwThU...  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_account"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<div class=\"weui-desktop-vm_primary\"><img src=\"http://mmbiz.qpic.cn/mmbiz_png/5ROs96OaibImzsCJhd0eXIzLfpoicm0RZqPBcaECvKx9e4o6VrVDZH1AqerI5ofrsROdTHUcmcN1IXyiaMtiaqH8Lw/0?wx_fmt=png\" class=\"inner_link_account_avatar\"> <strong class=\"inner_link_account_nickname\">新世相</strong> <i class=\"inner_link_account_wechat\">微信号：thefair2</i></div> <div class=\"weui-desktop-vm_default inner_link_account_type\">订阅号</div>\n"
     ]
    }
   ],
   "source": [
    "element = driver.find_element_by_xpath('//ul[@class=\"inner_link_account_list\"]/li')\n",
    "main_content = element.get_attribute('innerHTML')\n",
    "print(main_content)\n",
    "element.click()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n跳转_input = driver.find_element_by_xpath(\\'//span[@class=\"weui-desktop-pagination__form\"]/input\\')\\n跳转_a = driver.find_element_by_xpath(\\'//span[@class=\"weui-desktop-pagination__form\"]/a\\')\\n跳转_input.clear()\\n跳转_input.send_keys(2)\\n跳转_a.click()\\n'"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 跳转testing\n",
    "'''\n",
    "跳转_input = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/input')\n",
    "跳转_a = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/a')\n",
    "跳转_input.clear()\n",
    "跳转_input.send_keys(2)\n",
    "跳转_a.click()\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 280]\n"
     ]
    }
   ],
   "source": [
    "# 跳转上限\n",
    "l_e = driver.find_elements_by_xpath('//label[@class=\"weui-desktop-pagination__num\"]')\n",
    "l_e_int  = [int(x.text) for x in l_e] \n",
    "print (l_e_int)\n",
    "# print (l_e_int[0]==l_e_int[-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]\n"
     ]
    }
   ],
   "source": [
    "pages = list(range(l_e_int[0],l_e_int[1]-255 ))\n",
    "#print(pages[0:2])\n",
    "pages = list(range(1,l_e_int[1]-255))\n",
    "print(pages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 循环"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "html_raw = dict()\n",
    "main_content =\"\"\n",
    "element = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_pages (pages):\n",
    "    for p in pages:\n",
    "        print (p,end='\\t')\n",
    "\n",
    "        跳转_input = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/input')\n",
    "        跳转_a = driver.find_element_by_xpath('//span[@class=\"weui-desktop-pagination__form\"]/a')\n",
    "        跳转_input.clear()\n",
    "        跳转_input.send_keys(p)\n",
    "        跳转_a.click()\n",
    "\n",
    "        time.sleep(45+120*random())\n",
    "\n",
    "        element = driver.find_element_by_xpath('//div[@class=\"inner_link_article_list\"]')\n",
    "        main_content = element.get_attribute('innerHTML')\n",
    "        #print(main_content)\n",
    "        html_raw[p] = main_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t11\t12\t13\t14\t15\t16\t17\t18\t19\t20\t21\t22\t23\t24\t"
     ]
    }
   ],
   "source": [
    "process_pages(pages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>html_snippets</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>&lt;div&gt;&lt;label class=\"inner_link_article_item\"&gt;&lt;s...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        html_snippets\n",
       "1   <div><label class=\"inner_link_article_item\"><s...\n",
       "2   <div><label class=\"inner_link_article_item\"><s...\n",
       "3   <div><label class=\"inner_link_article_item\"><s...\n",
       "4   <div><label class=\"inner_link_article_item\"><s...\n",
       "5   <div><label class=\"inner_link_article_item\"><s...\n",
       "6   <div><label class=\"inner_link_article_item\"><s...\n",
       "7   <div><label class=\"inner_link_article_item\"><s...\n",
       "8   <div><label class=\"inner_link_article_item\"><s...\n",
       "9   <div><label class=\"inner_link_article_item\"><s...\n",
       "10  <div><label class=\"inner_link_article_item\"><s...\n",
       "11  <div><label class=\"inner_link_article_item\"><s...\n",
       "12  <div><label class=\"inner_link_article_item\"><s...\n",
       "13  <div><label class=\"inner_link_article_item\"><s...\n",
       "14  <div><label class=\"inner_link_article_item\"><s...\n",
       "15  <div><label class=\"inner_link_article_item\"><s...\n",
       "16  <div><label class=\"inner_link_article_item\"><s...\n",
       "17  <div><label class=\"inner_link_article_item\"><s...\n",
       "18  <div><label class=\"inner_link_article_item\"><s...\n",
       "19  <div><label class=\"inner_link_article_item\"><s...\n",
       "20  <div><label class=\"inner_link_article_item\"><s...\n",
       "21  <div><label class=\"inner_link_article_item\"><s...\n",
       "22  <div><label class=\"inner_link_article_item\"><s...\n",
       "23  <div><label class=\"inner_link_article_item\"><s...\n",
       "24  <div><label class=\"inner_link_article_item\"><s..."
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame([html_raw]).T\n",
    "df.columns = [\"html_snippets\"]\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Stored 'html_raw' (dict)\n"
     ]
    }
   ],
   "source": [
    "%store html_raw\n",
    "import pickle \n",
    "filehandler = open(\"html_raw\", 'wb') \n",
    "pickle.dump(html_raw, filehandler)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "24\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>html_snippets</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [html_snippets]\n",
       "Index: []"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_out = df[~df.duplicated()]\n",
    "print (len(df_out))\n",
    "df[df.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "try_again = list(df[df.duplicated()].index)\n",
    "print(try_again)\n",
    "try_again = try_again + list (set(pages).difference(set(df.index.values)))\n",
    "try_again"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 暂存档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = fn [\"output\"] [\"公众号_htm_snippets\"] \n",
    "df_out.to_csv(filename.format(公众号=公众号), sep=\"\\t\", encoding=\"utf-8\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6,7,6,8,8,6,9,7,7,7,8,7,8,7,9,10,8,9,5,6,7,10,11,10,"
     ]
    }
   ],
   "source": [
    "def parse_html_snippets(_snippet_):\n",
    "    root = fromstring(_snippet_) \n",
    "    title = [x.text for x in root.xpath('//div[@class=\"inner_link_article_title\"]')]\n",
    "    create_time = [x.text for x in root.xpath('//div[@class=\"inner_link_article_date\"]')]\n",
    "    link = [x for x in root.xpath('//a/@href')]\n",
    "    _df_ = pd.DataFrame({\"title\":title, \"create_time\": create_time, \"link\":link})\n",
    "    return(_df_)\n",
    "    \n",
    "l_df = []\n",
    "for p in pages:\n",
    "    _df_ = parse_html_snippets(df.loc[p,\"html_snippets\"])\n",
    "    print (len(_df_), end=\",\")\n",
    "    l_df.append(_df_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>30年后，赤名莉香不再相信爱情。我也是</td>\n",
       "      <td>2020-05-16</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>这是男朋友绝对给不了你的18种快乐</td>\n",
       "      <td>2020-05-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>“有些事可以自私点，比如快乐” | 这135万人终于想开了</td>\n",
       "      <td>2020-05-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>全日本最狠的大哥正在漏尿，看完我想给他们打钱</td>\n",
       "      <td>2020-05-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>“暗恋的姑娘借钱不还，这首歌却劝我认命” | 48小时交换</td>\n",
       "      <td>2020-05-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>故宫出了道题，全国女孩拿回家，把男朋友都玩哭了</td>\n",
       "      <td>2020-05-13</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>我家小渣男开学第一天，我乐疯了</td>\n",
       "      <td>2020-05-12</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>你手机里一定有这张照片，是谈恋爱都比不上的浪漫瞬间</td>\n",
       "      <td>2020-05-11</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>“谈恋爱很作的女孩，该骂吗？”心理学家说：“别”</td>\n",
       "      <td>2020-05-11</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>“天哪，我妈真好看！”</td>\n",
       "      <td>2020-05-10</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>“爱情这种事，不能指望算命”| 占卜师见证的垮掉的爱情</td>\n",
       "      <td>2020-05-09</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            title create_time  \\\n",
       "0             30年后，赤名莉香不再相信爱情。我也是  2020-05-16   \n",
       "1               这是男朋友绝对给不了你的18种快乐  2020-05-15   \n",
       "2   “有些事可以自私点，比如快乐” | 这135万人终于想开了  2020-05-15   \n",
       "3          全日本最狠的大哥正在漏尿，看完我想给他们打钱  2020-05-14   \n",
       "4   “暗恋的姑娘借钱不还，这首歌却劝我认命” | 48小时交换  2020-05-14   \n",
       "5         故宫出了道题，全国女孩拿回家，把男朋友都玩哭了  2020-05-13   \n",
       "6                 我家小渣男开学第一天，我乐疯了  2020-05-12   \n",
       "7       你手机里一定有这张照片，是谈恋爱都比不上的浪漫瞬间  2020-05-11   \n",
       "8        “谈恋爱很作的女孩，该骂吗？”心理学家说：“别”  2020-05-11   \n",
       "9                     “天哪，我妈真好看！”  2020-05-10   \n",
       "10    “爱情这种事，不能指望算命”| 占卜师见证的垮掉的爱情  2020-05-09   \n",
       "\n",
       "                                                 link  \n",
       "0   http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "1   http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "2   http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "3   http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "4   http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "5   http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "6   http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "7   http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "8   http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "9   http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "10  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  "
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out = pd.concat(l_df).reset_index(drop=True)\n",
    "df_url_out.loc[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>value</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>全日本最狠的大哥正在漏尿，看完我想给他们打钱</td>\n",
       "      <td>2020-05-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>我家小渣男开学第一天，我乐疯了</td>\n",
       "      <td>2020-05-12</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>“天哪，我妈真好看！”</td>\n",
       "      <td>2020-05-10</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>分手时我带着367块钱离开他。3年后我月薪五万，只爱自己</td>\n",
       "      <td>2020-05-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>刘慈欣最新科幻漫画，全球首发，赠品超惊艳</td>\n",
       "      <td>2020-05-08</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>178</th>\n",
       "      <td>为了年终奖，你想不到你同事私下有多拼</td>\n",
       "      <td>2020-01-06</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>179</th>\n",
       "      <td>武林外传让我笑了14年，今天我怎么就哭了？</td>\n",
       "      <td>2020-01-04</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>181</th>\n",
       "      <td>“29岁我终于可以不看人脸色了。现在告诉你付出了什么”</td>\n",
       "      <td>2020-01-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>182</th>\n",
       "      <td>“30岁，为了不被催婚，我做了两件狠心事”</td>\n",
       "      <td>2020-01-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>185</th>\n",
       "      <td>我骂了10年甲方，今天我爱上他们了</td>\n",
       "      <td>2020-01-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>117 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                              title create_time  \\\n",
       "value                                             \n",
       "3            全日本最狠的大哥正在漏尿，看完我想给他们打钱  2020-05-14   \n",
       "6                   我家小渣男开学第一天，我乐疯了  2020-05-12   \n",
       "9                       “天哪，我妈真好看！”  2020-05-10   \n",
       "11     分手时我带着367块钱离开他。3年后我月薪五万，只爱自己  2020-05-08   \n",
       "12             刘慈欣最新科幻漫画，全球首发，赠品超惊艳  2020-05-08   \n",
       "...                             ...         ...   \n",
       "178              为了年终奖，你想不到你同事私下有多拼  2020-01-06   \n",
       "179           武林外传让我笑了14年，今天我怎么就哭了？  2020-01-04   \n",
       "181     “29岁我终于可以不看人脸色了。现在告诉你付出了什么”  2020-01-03   \n",
       "182           “30岁，为了不被催婚，我做了两件狠心事”  2020-01-03   \n",
       "185               我骂了10年甲方，今天我爱上他们了  2020-01-02   \n",
       "\n",
       "                                                    link  \n",
       "value                                                     \n",
       "3      http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "6      http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "9      http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "11     http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "12     http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "...                                                  ...  \n",
       "178    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "179    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "181    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "182    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "185    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "\n",
       "[117 rows x 3 columns]"
      ]
     },
     "execution_count": 115,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# tagging 标记\n",
    "# 设置自己想要的关键词\n",
    "tagging_list = [\"\",\"男孩\",\"女孩\",\\\n",
    "                \"人生\",\\\n",
    "                \"生活\",\\\n",
    "                \"年轻人\",\\\n",
    "                \"交换\", \"城市\",\\\n",
    "                \"晚安\",\\\n",
    "                \"故事\",\"快乐\",\\\n",
    "                \"深夜\",\"喜欢\", \"孩子\", \\\n",
    "                \"我们\",\"照片\",\\\n",
    "                \"男朋友\",\\\n",
    "                \"女朋友\",\\\n",
    "                \"爱情\",\\\n",
    "                \"失去\",\"为什么\"] #overwritable\n",
    "\n",
    "v_v_list = []\n",
    "\n",
    "for tag in tagging_list:\n",
    "    index_list = df_url_out [ df_url_out.title.str.contains(tag) ].index.tolist()\n",
    "    v_v_pairs = pd.DataFrame({tag:index_list}).melt().set_index(\"value\")\n",
    "    v_v_list.append(v_v_pairs)\n",
    "\n",
    "df_cat = v_v_list[0]\n",
    "for d in v_v_list:\n",
    "    df_cat.update(d)\n",
    "    \n",
    "# 尚未标记内容\n",
    "df_url_out.loc [ df_cat.query('variable==\"\"').index ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg==&mid=2651823574&idx=3&sn=74f55464e7d2d1a6f777b98428b8c5cb&chksm=f11eda4dc669535bc69f7b789df016e54ced0d7bbe3036ed1f8f1414ab6d22e00ac648f3674d#rd'"
      ]
     },
     "execution_count": 116,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out.loc[24].link"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [title, create_time, link]\n",
       "Index: []"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out[df_url_out.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>30年后，赤名莉香不再相信爱情。我也是</td>\n",
       "      <td>2020-05-16</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>这是男朋友绝对给不了你的18种快乐</td>\n",
       "      <td>2020-05-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>“有些事可以自私点，比如快乐” | 这135万人终于想开了</td>\n",
       "      <td>2020-05-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>全日本最狠的大哥正在漏尿，看完我想给他们打钱</td>\n",
       "      <td>2020-05-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>“暗恋的姑娘借钱不还，这首歌却劝我认命” | 48小时交换</td>\n",
       "      <td>2020-05-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>181</th>\n",
       "      <td>“29岁我终于可以不看人脸色了。现在告诉你付出了什么”</td>\n",
       "      <td>2020-01-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>182</th>\n",
       "      <td>“30岁，为了不被催婚，我做了两件狠心事”</td>\n",
       "      <td>2020-01-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>183</th>\n",
       "      <td>那些和医生谈恋爱的女生，为什么想分手？</td>\n",
       "      <td>2020-01-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>184</th>\n",
       "      <td>我们分析了28个成年人：他们除了没人心疼，哪里都疼</td>\n",
       "      <td>2020-01-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>185</th>\n",
       "      <td>我骂了10年甲方，今天我爱上他们了</td>\n",
       "      <td>2020-01-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>186 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                             title create_time  \\\n",
       "0              30年后，赤名莉香不再相信爱情。我也是  2020-05-16   \n",
       "1                这是男朋友绝对给不了你的18种快乐  2020-05-15   \n",
       "2    “有些事可以自私点，比如快乐” | 这135万人终于想开了  2020-05-15   \n",
       "3           全日本最狠的大哥正在漏尿，看完我想给他们打钱  2020-05-14   \n",
       "4    “暗恋的姑娘借钱不还，这首歌却劝我认命” | 48小时交换  2020-05-14   \n",
       "..                             ...         ...   \n",
       "181    “29岁我终于可以不看人脸色了。现在告诉你付出了什么”  2020-01-03   \n",
       "182          “30岁，为了不被催婚，我做了两件狠心事”  2020-01-03   \n",
       "183            那些和医生谈恋爱的女生，为什么想分手？  2020-01-03   \n",
       "184      我们分析了28个成年人：他们除了没人心疼，哪里都疼  2020-01-02   \n",
       "185              我骂了10年甲方，今天我爱上他们了  2020-01-02   \n",
       "\n",
       "                                                  link  \n",
       "0    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "1    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "2    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "3    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "4    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "..                                                 ...  \n",
       "181  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "182  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "183  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "184  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "185  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...  \n",
       "\n",
       "[186 rows x 3 columns]"
      ]
     },
     "execution_count": 118,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_url_out[~df_url_out.duplicated()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>create_time</th>\n",
       "      <th>link</th>\n",
       "      <th>variable</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>30年后，赤名莉香不再相信爱情。我也是</td>\n",
       "      <td>2020-05-16</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "      <td>爱情</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>这是男朋友绝对给不了你的18种快乐</td>\n",
       "      <td>2020-05-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "      <td>男朋友</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>“有些事可以自私点，比如快乐” | 这135万人终于想开了</td>\n",
       "      <td>2020-05-15</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "      <td>快乐</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>全日本最狠的大哥正在漏尿，看完我想给他们打钱</td>\n",
       "      <td>2020-05-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>“暗恋的姑娘借钱不还，这首歌却劝我认命” | 48小时交换</td>\n",
       "      <td>2020-05-14</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "      <td>交换</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>181</th>\n",
       "      <td>“29岁我终于可以不看人脸色了。现在告诉你付出了什么”</td>\n",
       "      <td>2020-01-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>182</th>\n",
       "      <td>“30岁，为了不被催婚，我做了两件狠心事”</td>\n",
       "      <td>2020-01-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>183</th>\n",
       "      <td>那些和医生谈恋爱的女生，为什么想分手？</td>\n",
       "      <td>2020-01-03</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "      <td>为什么</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>184</th>\n",
       "      <td>我们分析了28个成年人：他们除了没人心疼，哪里都疼</td>\n",
       "      <td>2020-01-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "      <td>我们</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>185</th>\n",
       "      <td>我骂了10年甲方，今天我爱上他们了</td>\n",
       "      <td>2020-01-02</td>\n",
       "      <td>http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...</td>\n",
       "      <td>无法分类</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>186 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                             title create_time  \\\n",
       "0              30年后，赤名莉香不再相信爱情。我也是  2020-05-16   \n",
       "1                这是男朋友绝对给不了你的18种快乐  2020-05-15   \n",
       "2    “有些事可以自私点，比如快乐” | 这135万人终于想开了  2020-05-15   \n",
       "3           全日本最狠的大哥正在漏尿，看完我想给他们打钱  2020-05-14   \n",
       "4    “暗恋的姑娘借钱不还，这首歌却劝我认命” | 48小时交换  2020-05-14   \n",
       "..                             ...         ...   \n",
       "181    “29岁我终于可以不看人脸色了。现在告诉你付出了什么”  2020-01-03   \n",
       "182          “30岁，为了不被催婚，我做了两件狠心事”  2020-01-03   \n",
       "183            那些和医生谈恋爱的女生，为什么想分手？  2020-01-03   \n",
       "184      我们分析了28个成年人：他们除了没人心疼，哪里都疼  2020-01-02   \n",
       "185              我骂了10年甲方，今天我爱上他们了  2020-01-02   \n",
       "\n",
       "                                                  link variable  \n",
       "0    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...       爱情  \n",
       "1    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...      男朋友  \n",
       "2    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...       快乐  \n",
       "3    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...     无法分类  \n",
       "4    http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...       交换  \n",
       "..                                                 ...      ...  \n",
       "181  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...     无法分类  \n",
       "182  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...     无法分类  \n",
       "183  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...      为什么  \n",
       "184  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...       我们  \n",
       "185  http://mp.weixin.qq.com/s?__biz=MzI2OTA3MTA5Mg...     无法分类  \n",
       "\n",
       "[186 rows x 4 columns]"
      ]
     },
     "execution_count": 120,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_o = df_url_out.join(df_cat).replace(\"\", np.nan).fillna(\"无法分类\")\n",
    "df_o"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>variable</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>无法分类</th>\n",
       "      <td>117</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>照片</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>爱情</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>我们</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>交换</th>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>女孩</th>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>男朋友</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>快乐</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>失去</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>喜欢</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>女朋友</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>晚安</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>人生</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>为什么</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>故事</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>城市</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>男孩</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          title\n",
       "variable       \n",
       "无法分类        117\n",
       "照片           14\n",
       "爱情            9\n",
       "我们            9\n",
       "交换            7\n",
       "女孩            6\n",
       "男朋友           4\n",
       "快乐            3\n",
       "失去            3\n",
       "喜欢            3\n",
       "女朋友           2\n",
       "晚安            2\n",
       "人生            2\n",
       "为什么           2\n",
       "故事            1\n",
       "城市            1\n",
       "男孩            1"
      ]
     },
     "execution_count": 121,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_stats = df_o.groupby(by=\"variable\").agg({\"title\":\"count\"}).sort_values(by=\"title\", ascending=False)\n",
    "df_stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 输出"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_account.columns.name = \"rel_accounts\"\n",
    "df_o.columns.name = \"url_cat\"\n",
    "df_stats.columns.name = \"stats\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'stats'"
      ]
     },
     "execution_count": 123,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_df_.columns.name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [],
   "source": [
    "with pd.ExcelWriter(fn[\"output\"][\"公众号_xlsx\"].format(公众号=公众号)) as writer:\n",
    "    workbook  = writer.book\n",
    "\n",
    "    for _df_ in [df_account, df_o, df_stats]:\n",
    "        _df_.to_excel(writer, sheet_name = _df_.columns.name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_url_out.to_excel(\"新世相.xlsx\",\\\n",
    "                sheet_name=\"新世相\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_o.to_excel(\"新世相_分类.xlsx\",\\\n",
    "                sheet_name=\"新世相_分类\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
